Skip to content

feat: zarr3 #220

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Open

feat: zarr3 #220

wants to merge 16 commits into from

Conversation

floriankrb
Copy link
Member

@floriankrb floriankrb commented Feb 26, 2025

Description

Anemoi-datasets should be agnostic to the version of zarr. It should run with zarr3 installed or with zarr2 installed.

We should also take into account that zarr2 code cannot read datasets created by zarr3. So, ideally, we should
1 - update anemoi-datasets to work with zarr2 and zarr3
2 - keep using zarr2 to build datasets (dependency anemoi-datasets[create] on zarr<=2, but have zarr2 or 3 when reading (dependency of anemoi-datasets on zarr
3 - when user have updated their environment (6 months?), start building zarr3 datasets

This PR mostly addresses the first point, making anemoi-datasets detect the version of zarr and adapt to it.
What is still missing is:

  • some work on src/anemoi/datasets/data/stores.py, we had performance/crashing issues when reading on S3 buckets, this is difficult to reproduce or benchmark. Moreover, the inferface to have custom stores has changed between zarr2 and zarr3, the feature we used is not available (yet?).
  • zarr3 misses some support for datetime data which is a feature that was present in zarr2 and we use.

📚 Documentation preview 📚: https://anemoi-datasets--220.org.readthedocs.build/en/220/

@github-actions github-actions bot added tests enhancement New feature or request labels Feb 26, 2025
@codecov-commenter
Copy link

codecov-commenter commented Feb 27, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 73.85%. Comparing base (80de4c6) to head (1459e83).
Report is 62 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #220      +/-   ##
==========================================
+ Coverage   72.96%   73.85%   +0.88%     
==========================================
  Files          10       10              
  Lines         825      872      +47     
==========================================
+ Hits          602      644      +42     
- Misses        223      228       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@github-actions github-actions bot added the dependencies Pull requests that update a dependency file label Feb 27, 2025
@anaprietonem
Copy link
Contributor

anaprietonem commented Jun 3, 2025

@floriankrb what was the conclusion in terms of adding/updating to zarr 3?

@floriankrb
Copy link
Member Author

@anaprietonem I updated the description of the PR

@floriankrb floriankrb mentioned this pull request Jun 23, 2025
@frazane
Copy link
Contributor

frazane commented Jun 25, 2025

Just adding a small comment here. You've probably considered this already but here we go.

The transition to Zarr v3 specification comes with a nice opportunity: we could use the sharding feature! It would allow us to have chunked variables (avoiding having to load them all in memory before taking a subset) while still retaining very high read speeds. Basically we would make each timestamp a shard, and then chunk variables inside the shard. The downside is slower write speed, but it’s not very important.

More info:

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Jun 26, 2025
@b8raoult
Copy link
Collaborator

b8raoult commented Jun 26, 2025

All tests are now passing with zarr2 and zarr3. At the moment of writing, zarr3 still does not support datetime64 (zarr-developers/zarr-python#2616)

Tests are much more slower with zarr3, this will require more investigation. For example test_slice_4 takes 98s with zarr2 (1.5 minutes) and 612s with zarr3 (10 minutes)

The profiler shows for zarr2:

   773147    0.339    0.000   95.489    0.000 .../anemoi-datasets/src/anemoi/datasets/data/stores.py:148(__getitem__)
   773165    1.553    0.000   95.152    0.000 .../python3.12/site-packages/zarr/core.py:656(__getitem__)

and for zarr3:

   773147    0.449    0.000  615.551    0.001 .../anemoi-datasets/src/anemoi/datasets/data/stores.py:148(__getitem__)
   773165    2.088    0.000  615.118    0.001 .../python3.12/site-packages/zarr/core/array.py:2294(__getitem__)

The test uses an in-memory zarr store. The test calls __getitem__ 770k time. With a time per call of 0.001s, we have 0.001 x 770000 is 770, which is in the ballpark of 615.

Copy link

@tjhunter tjhunter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@floriankrb @b8raoult this is fantastic, thank you for the hard work! We will try it this week or next and get back to you.


def __delitem__(self, key: str) -> None:
"""Prevent deletion of items."""
raise NotImplementedError()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this operation is illegale, it is not a not implementer error

"""Prevent deletion of items."""
raise NotImplementedError()

def __setitem__(self, key: str, value: bytes) -> None:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

def __getitem__(self, key: str) -> bytes:
"""Retrieve an item from the store and print debug information."""
# print()
print("GET", key, self)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am putting linter rules in wegen to disable all print statements

from typing import Any
from typing import Optional

import zarr

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you could move this import into create_array and put a version check there (you already have it in zarr_2_or_3)

def create_array(zarr_root, *args, **kwargs):
if "compressor" in kwargs and kwargs["compressor"] is None:
# compressor is deprecated, use compressors instead
kwargs.pop("compressor")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no need for checking if the key is there. you can do kwargs.pop("x", None)

import numpy as np

if dtype == "datetime64[s]":
dtype = np.dtype("int64")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.int64?

else:
LOG.warning("⚠️" * 80)
LOG.warning(
f"Only Zarr version 2 is supported when creating datasets, found version: {zarr.__version__}"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not support writing in zarr3 format for the time being, only reading. Is there a use case for it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation enhancement New feature or request tests
Projects
Status: Now In Progress
Development

Successfully merging this pull request may close these issues.

6 participants